Protein higher order structures determines its function.
1937 human proteins have unknown role (dark proteome) (Young-Ki Paik et al., 2018).
Development of methods for predicting protein properties on the basis of their primary structure in a way that is understandable for biologists and experimentally validated.
n-grams (k-tuple, k-mers):
Encoding of amino acids into n-grams is for the purposes of Machine Learning.
Peptide I: FKVWPDHGSG
Peptide II: YMCIYRAQTN
n-gram examples from peptide I and II:
Longer n-grams are more informative, but create larger attribute spaces that are more difficult to analyze.
Counting n-grams creates sparse matrices, that are causing dimensional problems.
| Number of sparse matrices | Package | File size [Mb] |
|---|---|---|
| 1 | base | 0.000214 Mb |
| 1 | slam | 0.001122 Mb |
| 10 | base | 0.000969 Mb |
| 10 | slam | 0.001312 Mb |
| 100 | base | 0.0765 Mb |
| 100 | slam | 0.002625 Mb |
| 1000 | base | 7.629601 Mb |
| 1000 | slam | 0.016357 Mb |
| 10000 | base | 762.939659 Mb |
| 10000 | slam | 0.153687 Mb |
Quick Permutation Test is a fast alternative to permutation tests for n-gram data. It also allows precise estimation of p-value.
QuiPT is avaible as part of the biogram R package.
Following peptides appear to be completely different in terms of amino acid composition.
Peptide I:
FKVWPDHGSG
Peptide II:
YMCIYRAQTN
| Group | Amino acids |
|---|---|
| 1 | C, I, L, K, M, F, P, W, Y, V |
| 2 | A, D, E, G, H, N, Q, R, S, T |
Peptide I: FKVWPDHGSG —–> 1111122222
Peptide II: YMCIYRAQTN —–> 1111122222Amyloid aggregates are found in tissues of people suffering from neurodegenerative disorders such as Alzheimer’s disease, Parkinson’s disease and many other diseases.
Amyloid aggregates (red) around neurons (green). Strittmatter Laboratory, Yale University.
Source: National Institute on Aging (NIA) | National Institutes of Health (NIH)
Peptide sequences with amyloidogenic properties are responsible for the aggregation of amyloidogenic proteins (hot spots):
(Sawaya et al. 2007)
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
| Package | Runtime [h] | Memory usage [GB] | ||
|---|---|---|---|---|
| mtry= | ||||
| 5000 | 15,000 | 135,000 | ||
| randomForest | 101.24 | 116.15 | 248.60 | 39.05 |
| randomForest (MC) | 32.10 | 53.84 | 110.85 | 105.77 |
| bigrf | NA | NA | NA | NA |
| randomForestSRC | 1.27 | 3.16 | 14.55 | 46.82 |
| Random Jungle | 1.51 | 3.60 | 12.83 | 0.40 |
| Rborist | NA | NA | NA | >128 |
| ranger | 0.56 | 1.05 | 4.58 | 11.26 |
| ranger (save.memory) | 0.93 | 2.39 | 11.15 | 0.24 |
| ranger (GWAS mode) | 0.23 | 0.51 | 2.32 | 0.23 |
Marvin N. Wright and Andreas Ziegler. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 1, 77
Marvin N. Wright and Andreas Ziegler. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 1, 77
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Do standard reduced alphabets developed for different biological issues help to improve amyloid prediction?
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Standard amino acid alphabets do not improve the quality of amyloid prediction.
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
17 measures handpicked from AAIndex database:
size of residues,
hydrophobicity,
solvent surface area,
frequency in \(\beta\)-sheets,
contactivity.
524 284 reduced amino acid alphabets with different level of amino acid alphabet reduction (three to six amino acid groups).
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
For each category the alphabets have been ranked (rank 1 for the best AUC, etc.).
The best alphabet was the one with the lowest rank sum.
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
| Group | Amino acids |
|---|---|
| 1 | G |
| 2 | K, P, R |
| 3 | I, L, V |
| 4 | F, W, Y |
| 5 | A, C, H, M |
| 6 | D, E, N, Q, S, T |
Group 3 & 4 - hydrophobic amino acids.
Group 2 - amino acids disrupting the \(\beta\)-structure (\(\beta\)-breakers).
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Is the best-performing reduced amino acid alphabet associated with amyloidogenicity?
Similarity index (Stephenson and Freeland 2013) measures the similarity between two reduced alphabets (1:~identical alphabets, 0:~completely dissimilar alphabets).
The correlation between the similarity index and the average AUC is important (\(\textrm{p-value} \leq 2.2^{-16}\); \(\rho = 0.51\)).
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Are informative n-grams found by QuiPT associated with amyloidogenicity?
Of the 65 most informative n-grams, 15 (23%) are also present in amino acid motifs found experimentally (Paz and Serrano 2004).
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
| Program | AUC | MCC |
|---|---|---|
| AmyloGram | 0.8972 | 0.6307 |
| PASTA 2.0 (Walsh et al. 2014) | 0.8550 | 0.4291 |
| FoldAmyloid (Garbuzynskiy, Lobanov, and Galzitskaya 2010) | 0.7351 | 0.4526 |
| APPNN (Família et al. 2015) | 0.8343 | 0.5823 |
The classifier trained using the best reduced alphabet, AmyloGram, has been compared with other amyloid prediction tools using an external dataset .
MCC (Matthew’s Correlation Coefficient) measures the performance of a classifier (1 - classifier always properly recognizes amyloid proteins, -1 - classifier never properly recognizes amyloid proteins)
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
A new functional amyloid produced by Methanospirillum sp. (Christensen et al. 2018) was selected for analysis by AmyloGram.
Models predicting the properties of proteins may be based on precise rules that are understandable to biologists and experimentally verifiable without losing their effectiveness.
Michał Burdukiewicz (Warsaw University of technology).
Małgorzata Kotulska (Wrocław University of Science and Technology).
Stefan Rödiger (Brandenburg University of Technology Cottbus-Senftenberg).
Paweł Mackiewicz (University of Wrocław).
Piotr Sobczyk (Wrocław University of Science and Technology).
Funding:
Polish National Science Centre (2015/17/N/NZ2/01845 & 2017/24/T/NZ2/00003).
COST ACTION CA15110 (Harmonising standardisation strategies to increase efficiency and competitiveness of European life-science research).
KNOW Wrocław Center for Biotechnology.
German Federal Ministry of Education and Research (InnoProfile-Transfer-Projekt 03IPT611X).
Web servers:
R packages:
Burdukiewicz, Michał, Piotr Sobczyk, Stefan Rödiger, Anna Duda-Madej, Paweł Mackiewicz, and Małgorzata Kotulska. 2016. “Prediction of Amyloidogenicity Based on the N-Gram Analysis.” e2390v1. PeerJ Preprints. https://peerj.com/preprints/2390.
Burdukiewicz, Michał, Piotr Sobczyk, Stefan Rödiger, Anna Duda-Madej, Paweł Mackiewicz, and Małgorzata Kotulska. 2017. “Amyloidogenic Motifs Revealed by N-Gram Analysis.” Scientific Reports 7 (1): 12961. doi:10.1038/s41598-017-13210-9.
Christensen, Line Friis Bakmann, Lonnie Maria Hansen, Kai Finster, Gunna Christiansen, Per Halkjær Nielsen, Daniel Erik Otzen, and Morten Simonsen Dueholm. 2018. “The Sheaths of Methanospirillum Are Made of a New Type of Amyloid Protein.” Frontiers in Microbiology 9: 2729. doi:10.3389/fmicb.2018.02729.
Família, Carlos, Sarah R. Dennison, Alexandre Quintas, and David A. Phoenix. 2015. “Prediction of Peptide and Protein Propensity for Amyloid Formation.” PLOS ONE 10 (8): e0134679. doi:10.1371/journal.pone.0134679.
Garbuzynskiy, Sergiy O., Michail Yu Lobanov, and Oxana V. Galzitskaya. 2010. “FoldAmyloid: A Method of Prediction of Amyloidogenic Regions from Protein Sequence.” Bioinformatics (Oxford, England) 26 (3): 326–32. doi:10.1093/bioinformatics/btp691.
Murphy, Lynne Reed, Anders Wallqvist, and Ronald M. Levy. 2000. “Simplified Amino Acid Alphabets for Protein Fold Recognition and Implications for Folding.” Protein Engineering 13 (3): 149–52. doi:10.1093/protein/13.3.149.
Paz, Manuela López de la, and Luis Serrano. 2004. “Sequence Determinants of Amyloid Fibril Formation.” Proceedings of the National Academy of Sciences 101 (1): 87–92. doi:10.1073/pnas.2634884100.
Sawaya, Michael R., Shilpa Sambashivan, Rebecca Nelson, Magdalena I. Ivanova, Stuart A. Sievers, Marcin I. Apostol, Michael J. Thompson, et al. 2007. “Atomic Structures of Amyloid Cross-β Spines Reveal Varied Steric Zippers.” Nature 447 (7143): 453–57. doi:10.1038/nature05695.
Stephenson, James D., and Stephen J. Freeland. 2013. “Unearthing the Root of Amino Acid Similarity.” Journal of Molecular Evolution 77 (4): 159–69. doi:10.1007/s00239-013-9565-0.
Walsh, Ian, Flavio Seno, Silvio C. E. Tosatto, and Antonio Trovato. 2014. “PASTA 2.0: An Improved Server for Protein Aggregation Prediction.” Nucleic Acids Research 42 (W1): W301–W307. doi:10.1093/nar/gku399.